Skip to content

perf(vectors): batch per-file embeddings + offload parse/enrich#334

Merged
HumanBean17 merged 3 commits into
masterfrom
feat/vectors-stage-batching
Jun 21, 2026
Merged

perf(vectors): batch per-file embeddings + offload parse/enrich#334
HumanBean17 merged 3 commits into
masterfrom
feat/vectors-stage-batching

Conversation

@HumanBean17

@HumanBean17 HumanBean17 commented Jun 20, 2026

Copy link
Copy Markdown
Owner

Summary

The vectors stage (the embedding phase of init / increment / reprocess) underused the embedding model's batching. process_*_file embedded each chunk serially, so SentenceTransformerEmbedder (@coco.fn.as_async(batching=True, max_batch_size=64)) only ever saw batches of 1 — batching only groups concurrently in-flight calls, and a serial await loop keeps one in flight. Worst for increment (few changed files → no cross-file batching).

Levers (java_index_flow_lancedb.py)

  • AST by Opus #1 — concurrent per-file embedding (java/sql/yaml). asyncio.gather(*(embedder.embed(ch.text) for ch in chunks)) → one model.encode([...]) (up to 64) instead of N batch-of-1 encodes.
  • Tier 1 completion plan + proposals (B2a + B4 + B5) #2 — offload parse + enrich off the event loop (java). asyncio.to_thread(_parse_and_enrich_java, ...) so the loop keeps the embedder's batching queue fed while a file parses (matters under init's high file concurrency). splitter.split stays inline (shared Rust singleton).

Review-driven fixes (two code-review rounds)

  • ast_java.py: parse_java is now thread-safe via a per-thread Parser (threading.local()). Round 1 found that Tier 1 completion plan + proposals (B2a + B4 + B5) #2 made enrich_chunk concurrent, and enrich transitively calls parse_java (collect_annotation_meta_chain → _collect_annotation_decl_index), which used the @lru_cache singleton tree-sitter Parser — not thread-safe. Root-cause fix at the source covers every caller (flow, enrich, graph builder). Added tests/test_ast_java_thread_safety.py.
  • app_main: warm per-project enrich caches before fanning out to threads. Round 2 found that concurrent enrich caused a collect_annotation_meta_chain thundering herd (up to ~14 worker threads each cold-missing and redundantly walk+parseing the whole project at cold start). Warming once on the event loop eliminates it.

Correctness — verified clean

Two full review rounds (8 finder angles + 4 deep concurrency verifiers, with empirical 16-thread × 2000-parse stress tests). Zero remaining races: the per-thread Parser covers every parse_java call site; an exhaustive transitive walk of enrich_chunk's callees found no other non-thread-safe shared state (pure functions, immutable frozenset constants, read-only lru_cache returns, ast/chunks never mutated). declare_row buffers → commit is all-or-nothing via the cocoindex memo guard, so no partial writes on mid-file failure.

Behaviour / compatibility

  • Output is equivalent: same chunk texts → same normalized vectors; rows still declared in chunk order; uuids stay random. No schema/ontology/env change.
  • First run after upgrade reprocesses once (process_java_file is @coco.fn(memo=True), changed body invalidates memo entries), then returns to incremental.

Validation

  • ruff check . — clean.
  • .venv/bin/python -m pytest tests -q839 passed, 13 skipped (heavy-gated).
  • Heavy e2e JAVA_CODEBASE_RAG_RUN_HEAVY=1 ... test_lancedb_e2e.py — main path passes (full cocoindex update --full-reprocess + graph + MCP search).
  • test_lancedb_ignore_file_reduces_indexed_java_files fails — pre-existing on master (verified by stashing the change and reproducing the identical FileExistsError; the test's own mkdir collides with a tracked fixture file). Unrelated; left untouched per scope.

🤖 Generated with Claude Code

HumanBean17 and others added 3 commits June 20, 2026 23:42
process_*_file embedded each chunk serially (one `await embedder.embed()`
per chunk in a for-loop), so the batched SentenceTransformerEmbedder
(max_batch_size=64) only ever saw batches of 1. This wasted its batching
within a file and hurt `increment` most: few changed files means little
cross-file concurrency, so embeddings ran one-at-a-time.

#1 — embed all chunks of a file concurrently via asyncio.gather so the
embedder groups them into one model.encode(...). Applied to java/sql/yaml.

#2 — for java, move parse_java + per-chunk enrich_chunk off the event loop
(asyncio.to_thread) so the loop keeps the embedder's batching queue fed
while a file is being parsed. parse_java is serialized by _PARSE_LOCK
because it reuses the lru-cached singleton tree-sitter Parser, whose
parse() mutates state and is not thread-safe. splitter.split stays inline
(shared Rust RecursiveSplitter singleton); enrich_chunk is pure-Python over
the immutable AST (lru_cache reads are GIL-safe) and runs without the lock.

Output is equivalent: same chunk texts → same normalized vectors; rows are
still declared in chunk order; uuids stay random. No schema/ontology/env
change. Because process_java_file is @coco.fn(memo=True), the changed body
invalidates memo entries, so the first run after upgrade reprocesses
(equivalent to a full reindex) — subsequent runs are incremental again.

Verified: ruff clean; 838 light tests pass; heavy test_lancedb_e2e (real
cocoindex update --full-reprocess + graph + search) passes. One heavy test
(test_lancedb_ignore_file_reduces_indexed_java_files) fails, but it fails
identically on clean master (FileExistsError in the test's own mkdir,
unrelated to this change).

Co-Authored-By: Claude <noreply@anthropic.com>
Code review of the vectors perf change found a race it introduced:
_parse_and_enrich_java runs enrich_chunk outside _PARSE_LOCK on concurrent
worker threads, and enrich_chunk -> collect_annotation_meta_chain (lru_cached,
non-serializing) -> _collect_annotation_decl_index -> parse_java. On a cold
cache, multiple threads hit parse_java concurrently on the shared @lru_cache
tree-sitter Parser, whose parse() mutates state and is not thread-safe ->
corrupt AST or native segfault. (Before the perf change, enrich ran on the
single event-loop thread, so this was serialized.)

Root-cause fix at the source: _parser() now returns a per-thread Parser
(threading.local) instead of a process-wide lru_cached singleton. Each OS
thread gets its own Parser; the Language is immutable and shared. This
protects EVERY parse_java caller — the flow's direct parse, enrich's
transitive parse, and build_ast_graph — and preserves parse parallelism
rather than serializing it. Removes the now-redundant _PARSE_LOCK from the
flow file.

Added tests/test_ast_java_thread_safety.py: 16 threads × 60 parses must each
match the single-threaded reference (locks the invariant in; a revert to a
shared Parser would corrupt/crash here).

Verified: ruff clean; 839 light tests pass (+1 new); heavy test_lancedb_e2e
main path (full reprocess + graph + search) passes. The pre-existing
test_lancedb_ignore_file_reduces_indexed_java_files failure is unchanged
(its own mkdir bug, fails on master).

Co-Authored-By: Claude <noreply@anthropic.com>
…hreads

Second code-review round found a perf regression introduced by lever #2:
making enrich_chunk concurrent meant the first wave of worker threads each
cold-miss collect_annotation_meta_chain (lru_cached, non-serializing) and
each redundantly walk+parse the ENTIRE project (~min(32,cpu+4)x redundant
full-project parses at cold start, GIL-serialized into the critical path) —
directly offsetting the embedding-batching win on large repos.

Fix: warm collect_annotation_meta_chain + load_brownfield_overrides ONCE on
the event-loop thread in app_main, before coco.mount_each fans files into
worker threads. Key derivation mirrors enrich_chunk exactly so the warmed
entries are the ones workers hit; with warming every worker gets a cache
hit (lru_cache reads are thread-safe). Wrapped so a warm-up failure never
breaks indexing (falls back to lazy per-thread misses = prior behavior).

The same review round confirmed ZERO correctness races remain: the per-thread
Parser covers every parse_java call site (incl. enrich's transitive parse),
and an exhaustive walk of enrich_chunk's callees found no other non-thread-safe
shared state. Verified: ruff clean; heavy test_lancedb_e2e main path passes.

Co-Authored-By: Claude <noreply@anthropic.com>
@HumanBean17 HumanBean17 merged commit 9b1326b into master Jun 21, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant